[Webinar] Kafka + Disaster Recovery: Are You Ready? | Register Now
British Telecom (BT) is a multinational telecommunications company headquartered in London, England. It was founded in 1980 and has since grown to become one of the largest telecommunications companies in the world, providing a wide range of services including fixed-line services, broadband, mobile, and TV products and services. The GPO Tower was built by BT in 1965 to act as a microwave hub for the distribution of terrestrial TV broadcasts to a network of aerial masts scattered around the country. The tower was an important part of the microwave relay that was the backbone of the UK's telecommunications network at the time. It could handle up to 150,000 telephone conversations and 40 television channels when it opened. That responsibility still resides within the now iconic BT Tower and is handled by BT Network’s Media & Broadcast team.
Some 95% of the TV-destined content within the UK is carried by networks that are managed by the staff in the BT Tower. Keeping BBC 1, Radio 4, and the World Service on air is seen as critical to the security of the United Kingdom.
The existing network, originally built in 2004, carried most of the generic UK TV broadcast traffic but was straining to support increased traffic each year and was reaching the end of its service life. BT Media & Broadcast needed to build a replacement network quickly and within a tight budget. This traffic includes incoming video feeds from football stadiums, rugby grounds, and horse racing tracks, as well as services being imported from around the world such as international news reporting and events and feeds between geographically distributed UK studios and the consequential output to transmitters.
In order to improve repeatability of service creation and to reduce operational costs, the BT Networks Media & Broadcast team wanted to leverage advances in Software Defined - Wide Area Network (SD-WAN) technology to fully automate site provisioning, service ordering, and service assurance monitoring and reporting.
The project to build the physical infrastructure and to develop the orchestration system was named Vena. As an orchestration system Vena manages, updates, and maintains software configurations for the SD-WAN system that is the backbone of BT’s broadcasting infrastructure. As such, when requirements change or new broadcasting needs arise these are entered via the Vena user interface, which then takes care of the network provisioning. Rather than buy an off-the-shelf solution, BT decided to build the software system from scratch using open source components. The TV industry has an ingrained history of responding to developing situations and often “cobbling” together solutions in real time to respond to unpredictable world events. This means the Vena team often find themselves having to accommodate unanticipated features in weeks, sometimes days, and on occasions within hours. This could not be achieved without full ownership of the software stack which underpins service delivery. BT also saw data streaming as a technology that would play a vital role in their tech stack. More on this later.
This gave BT the ability to develop a solution tightly tailored to their esoteric customer needs. Along with an agile development methodology, BT could deliver features as needed and respond immediately to “late in the day” emergent requirements.
Vena’s media services typically take a live video feed handed off as a digital signal, possibly compress it with a broadcast quality compression algorithm, convert it into a series of IP packets, which are then delivered using two diverse circuit paths to multiple destinations. One such application is the delivery of all the free-to-air (Freeview) channels to the hundred or so transmitters scattered across the British Isles.
Vena’s orchestration system is heavily dependent upon the ability to reliably distribute and respond to messages flowing between physical equipment, microservices, and user interactions within the Vena portal. The ordering of events is particularly crucial to correctly identify and automatically respond to network faults.
Naturally, data streaming using Apache Kafka® was a good fit as the underlying messaging system upon which to build Vena. However, BT needed ironclad resilience between their secure data centres, up-to-date security patches (they have to be highly secure due to the potential disruption which could occur if their systems were compromised), and a technology partner that they could rely on for guidance and to respond quickly if something did go wrong. BT found that partner in Confluent.
Vena platform carries services for all major broadcasters in the UK. BT incurs severe SLA penalties (up to millions) if these services go offline and can lead to severe brand damage for BT.
The data presented here relates to network events and alarms, collected from around 3,000 devices distributed across the country. These devices are fairly chatty and would expect to process up to 2,000 SNMP data events per minute. BT will soon look to add telemetry data into their Service Assurance processing stack, which could expand the required data throughput by an order of magnitude.
Therefore, Vena needed a service management stack that is highly available, scalable, and robust that allows operations teams to be proactive with faults in the network.
Data streaming enables:
Real-time data: Data streaming is ideal for real-time or near-real-time data processing which allows for continuous and immediate data updates, making it suitable for applications like live analytics, monitoring, and instant notifications which enables BT to present a real-time view of network estate.
Scalability: Data streaming is often more scalable than traditional API methods. It can handle large volumes of data and scale horizontally to accommodate increased load, making it suitable for applications that require rapid data growth.
Event-driven architectures: Vena wanted to implement systems following an event-driven architecture, where data streaming is a natural choice. It fits well with the concept of handling events as they occur, making it easier to synchronize data across different systems.
Efficiency: If BT had to handle the volume of event data from the network and application estate with traditional APIs, client applications would often need to repeatedly poll for updates, resulting in higher network and server loads. Data streaming eliminates the need for frequent polling, reducing network and server overhead. It enables BT to publish a single message which any party interested can consume.
Availability and reliability: With the nature of services Vena carries, BT has 99.999 high availability and reliability requirements, so ensuring that data is consistently delivered and processed without interruption is a must.
Not surprisingly, API versus data streaming is dependent on use cases: BT also uses APIs where data updates are less time-sensitive, one-off point in time data retrieval, or when data updates are infrequent.
De facto: Kafka was being used across BT. Vena leveraged existing knowledge to implement best practices in design and implementation. There is also a sizable pool of engineers skilled in Kafka in the industry.
Community: There’s a large, active, and global user community with many conferences and events like Kafka Summit, where people share ideas, experiences, and best practices.
The connector ecosystem: The Kafka Connect framework allows customers to deploy ready-to-use “connectors” to consume and produce streams from and to databases and other applications.
Stream processing: Kafka Streams simplify application development by abstracting a lot of the complexity involved in working with high-throughput event streams.
Throughput, latency, and scale: This is required to deal with alarm processing and alarm storms in the network. Kafka operates at network speeds and can cope with millions of events. This becomes crucial when dealing with alarms where every second spent processing counts and where alarm storms can easily cause a massive flood of events.
Real-time message manipulation and routing: Kafka Connect and Kafka Streams provide the capability to route, transform, and enrich network events as they happen.
High resilience: Confluent Platform clusters deployed to tried and tested best practice standards with self-balancing allows BT to reduce operational effort and error.
Automated deployment: Deployment scripts using Terraform are used to spin up clusters and do zero downtime upgrades.
Data replication: Replicator and Cluster Linking are used for near-real-time geo-replication.
Confluent Control Center: Is leveraged for detailed monitoring of Confluent Platform and the data within topics.
Security: Confluent offers data encryption, bring your own key, audit logs, and fine-grained access control.
Support and maintenance expertise: Best-in-class support offered by Confluent enables the high uptime requirements of critical national
Using Confluent, Vena has been able to deliver two vital use cases that are at the core of the platform and are described below. The use cases demonstrate how Confluent interacts seamlessly with various types of data such as low-level network events, network monitoring events, as well as business process level data. Also using the Connect ecosystem, Confluent can in real time join disparate data sources and sinks together in real time such as syslog, Couchbase, and Neo4j.
The Vena network consists of devices and connections that link two sites together. For each channel, two separate circuits must be configured that follow different geographical routes to provide resilience in the case of a physical failure on part of the line or country. Once provisioned the signal is sent along both routes and combined at the end site to provide resilience against packet loss or corruption.
These connections are provided by various connectivity providers. However, there is often a significant delay, typically ranging from weeks to sometimes months, before both ends of a connection can be established. This presents a challenge as engineers must remember to initiate the link during the onboarding process manually every time a supplier establishes the links.
To address this issue, event-driven microservices come to the rescue. When both ends of the link are successfully connected, the device generates LLDP (Link Layer Discovery Protocol) events, which are then streamed to a syslog server. Using a syslog connector from Confluent, these messages are persisted to a topic. Once ingested to Confluent Platform, a consumer processes these messages and compares them for both devices at the ends of the link. If the messages match, it automatically triggers a Business Process and Model (BPMN) workflow that follows a series of network configurations and tests, ultimately automagically bringing the link into service.
This mechanism allows BT to quickly respond and provision new television circuits on demand for local news or sporting events, while minimizing manual effort and reducing potential for human error.
The second use case is the alarm pipeline which is built as a series of Kafka Streams apps which process Simple Network Management (SNMP) events from the physical network estate through the lifecycle in the diagram below:
Raw SNMP alarms are processed by the Alarms Streams app which filters unwanted alarms, normalizes the message payload, and performs basic enrichment. These are further enriched by the Alarms K-Table app. These early steps are all about data quality and ensuring the right messages are present with enough contextual information to be useful later in the process.
Next, the alarms are processed by the Correlated Alarms service which performs service impact analysis. This identifies which customer services are impacted and what is the priority of the impact. Finally, the Fault Coordinator service looks at other active faults and performs root cause analysis. This enables Vena to present the most relevant alarms to the operational team and filter out the noise resulting in swift remedial action. A Couchbase sync connector is used to persist the alarms in Couchbase, the long-term alarm store.
As alarms are processed, they also update the resource physical or logical state in the inventory which is materialized via a NEO4j database that gives a real-time view of network topology. This is a critical record as it is required to recover the service following any failures.
The use of Confluent as a data streaming platform allows BT Vena to use an automated system for provisioning television networks and bring live TV signals into homes in a fast, automated, and reliable way. It also provides the mechanism for monitoring and maintaining those networks to ensure ongoing service coverage.
Vena is currently deployed on BT’s private cloud estate, but the team is looking at moving Vena to Confluent Cloud to reduce the operational burden of managing Confluent Platform.
During the Vena development, the team has become skilled in building stream processing apps using KStreams and dealing with state stores. Connect has also proven a powerful tool for synchronizing real-time data from Confluent into storage and query engines such as Couchbase and Neo4j.
This blog explores how cloud service providers (CSPs) and managed service providers (MSPs) increasingly recognize the advantages of leveraging Confluent to deliver fully managed Kafka services to their clients. Confluent enables these service providers to deliver higher value offerings to wider...
With Confluent sitting at the core of their data infrastructure, Atomic Tessellator provides a powerful platform for molecular research backed by computational methods, focusing on catalyst discovery. Read on to learn how data streaming plays a central role in their technology.